Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debug instrumentation for test_play_services #1013

Merged
merged 4 commits into from
May 16, 2022

Conversation

Blast545
Copy link
Contributor

@Blast545 Blast545 commented May 12, 2022

Not a fix, but this PR should help with the test_play_services flaky tests described here: #862

Follow up discussion: #862 (comment)

Only reduced the number of msgs to 190 to confirm that rmw_fastrtps fails with specifically 195 iterations.
Only increased the timeout by 2 to avoid increasing the test time too much.

To test the PR I'll run two ci_linux jobs testing rosbag2_transport with the repeated jobs flags.

Signed-off-by: Jorge Perez [email protected]

@Blast545 Blast545 requested a review from a team as a code owner May 12, 2022 21:36
@Blast545 Blast545 requested review from gbiggs and jhdcs and removed request for a team May 12, 2022 21:36
@Blast545
Copy link
Contributor Author

Blast545 commented May 12, 2022

Testing with master + repeated flags on linux:
Build Status
Rebuild: Build Status

Testing with this PR + repeated flags on linux:
Build Status
Rebuild: Build Status

(adding a second round of CI because the error is flaky)

@jhdcs
Copy link
Contributor

jhdcs commented May 13, 2022

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

Sure, the issue may be from a separate repository, but this would be useful for providing data to the appropriate people in order to get the problem fixed...

@clalancette
Copy link
Contributor

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

I think we should do both. This test has been failing for quite a long time, and it makes the lives of all ROS 2 developers difficult. So working around it is the right thing for the short-term, while a bug report to rmw_fastrtps is the right thing for the long term.

@Blast545
Copy link
Contributor Author

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

@jhdcs I'm not sure if this is what we want, I opened the PR with the intention of gathering more info from CI and receiving feedback from maintainers + ROS2 team as to how to proceed.

This flaky error is affecting 50% of our https://build.ros2.org/view/Rci/job/Rci__nightly-release_ubuntu_jammy_amd64/ builds atm, and hasn't been addressed in a long time. We should do something.

@jhdcs
Copy link
Contributor

jhdcs commented May 13, 2022

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

@jhdcs I'm not sure if this is what we want, I opened the PR with the intention of gathering more info from CI and receiving feedback from maintainers + ROS2 team as to how to proceed.

This flaky error is affecting 50% of our https://build.ros2.org/view/Rci/job/Rci__nightly-release_ubuntu_jammy_amd64/ builds atm, and hasn't been addressed in a long time. We should do something.

I agree that something should be done. And thank you for stepping up to the plate. I just don't want this PR to be used as an excuse for not fixing the underlying problem - which doesn't appear to be an issue here.

@MichaelOrlov
Copy link
Contributor

@Blast545 @jhdcs @clalancette I would suggest some compromise solution.
I think at the moment it will be reasonable to try to increase wait timeout for the service calls and add some more debug information in tests.
I have tried to analyze this issue in more details last night comparing with other failing cases. I've traced all the pass till when we are sending request to via DDS data writer. And this path looks good for me. What I have not verified is the way how we are getting confirmation about response from service call.

What I've found is that mostly failing two tests PlaySrvsTest.toggle_paused and PlaySrvsTest.play_next. and mostly with rmw_fastrtps_dynamic_cpp. However I've seen the same failures with CycloneDDS.
It telling us that most likely the problem unrelated to the some specific DDS implementation or we incorrectly rely on some assumption which is true for both DDS implementations.

I have suspicious that there are some problem with timings or QoS. Although not sure in what place exactly. The higher failure rate with rmw_fastrtps_dynamic_cpp could be just because it's a bit slowly I guess.

I've tried to think about what is common in those two PlaySrvsTest.toggle_paused and PlaySrvsTest.play_next tests to cause a failure.
From Rosbag2::Player perspective of view almost nothing. PlaySrvsTest.play_next could take longer in response since it invoke publishing messages inside. However PlaySrvsTest.toggle_paused should be pretty fast there are almost nothing inside on Player side just lock for one mutex and toggling a flag.
I think that common in those two tests is the returning boolean value in play_next returning result true or false.
Toggle paused returning void. Although in test after each call for toggle_pause we are requesting status via another service call is_paused
https://github.com/ros2/rosbag2/blob/0.16.0/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L116-L120
https://github.com/ros2/rosbag2/blob/0.16.0/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L273
which is also returning boolean value.

I have a suspicious that failure happening for cases when service call has returning value.

@jhdcs
Copy link
Contributor

jhdcs commented May 13, 2022

@Blast545 @jhdcs @clalancette I would suggest some compromise solution.
I think at the moment it will be reasonable to try to increase wait timeout for the service calls and add some more debug information in tests.

This feels very reasonable to me. Just in case it's not clear, my only concern is if the underlying issue is never addressed. As long as we're making concrete steps to identify and correct the issue, creating some stopgap measures is perfectly fine!

From the rest of your post... It almost sounds like we may have a race condition going on... Does anyone else get that feeling?

- Increased service_call_timeout_ up to 3 seconds.
- Split `toggle_paused` on two separate tests `is_paused` and
`toggle_paused`
- Add output of the function name and line number in case of failure.

Signed-off-by: Michael Orlov <[email protected]>
@MichaelOrlov
Copy link
Contributor

@Blast545 I pushed commit with additional debug info.
Could you please try to re-run CI?

@Blast545
Copy link
Contributor Author

@MichaelOrlov Sure thing!

Repeated + PR on Linux
Build Status

From the initial CI that I ran, I see that the second repeated + PR failed with rmw_fastrtps, potentially invalidating the theory of the 195 meaning something internally for fastrtps.
https://ci.ros2.org/job/ci_linux/16705/

@MichaelOrlov
Copy link
Contributor

From the initial CI that I ran, I see that the second repeated + PR failed with rmw_fastrtps, potentially invalidating the theory of the 195 meaning something internally for fastrtps

Totally agree.

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented May 13, 2022

@Blast545 What I can see is that test running twice per build and each time it fails on the same step or iteration.
The iteration could vary from build to build but it's very consistent between tests run from the same build.
For example in one build it fails twice in play_next on 95 iteration https://ci.ros2.org/job/ci_linux/16709/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ on another build it fails twice on 118 iteration https://ci.ros2.org/job/ci_linux/16709/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/ and I have seen the same behavior for toggle_paused.
And please note that it fails on both CycloneDDS and FastRTPS.

It's actually usually pointing out that we have somewhere a dangling pointer or dangling reference which is getting overwritten at some iteration depending from the current memory layout.

I am unfortunately don't have more time to debug this issue deeper. And I am not very familiar with implementation of the service calls and SingleThreadedExecutor. We need to find someone who can continue analysis of this issue from this point.

@wjwwood Could you please help with debugging failure in service calls? Or please advise someone who can help with it.

@MichaelOrlov MichaelOrlov self-requested a review May 13, 2022 23:58
Copy link
Contributor

@MichaelOrlov MichaelOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Blast545 @clalancette I would suggest to merge this PR with added debug instrumentation and increased wait time for service call.

I would like to see if it will still fail on CI in toggle_paused or in new is_paused test.

@wjwwood
Copy link
Member

wjwwood commented May 14, 2022

@wjwwood Could you please help with debugging failure in service calls? Or please advise someone who can help with it.

Yeah, I can help. It might be Monday though before I can catch up on the thread. I'd be happy to chat out of band on Monday as well.

Signed-off-by: Jorge Perez <[email protected]>
@Blast545 Blast545 changed the title Increase test timeout and decrease msgs sent Add debug instrumentation for test_play_services May 16, 2022
@Blast545
Copy link
Contributor Author

I changed the title of this PR to better reflect the debug info added in @MichaelOrlov 's commit.

Also restored the msgs_to_publish from 190 to 200 as the number 195 wasn't representative as we originally thought.

@Blast545
Copy link
Contributor Author

Full CI: --packages-up-to rosbag2_transport + packages-select rosbag2_transport

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Add redefinition for __PRETTY_FUNCTION__ to __FUNCSIG__ if it is not
defined and not gcc compiler.

Signed-off-by: Michael Orlov <[email protected]>
@MichaelOrlov
Copy link
Contributor

@Blast545 Sorry I forgot that __PRETTY_FUNCTION__ doesn't exist on Windows. This is why CI failed on windows.
I've pushed workaround. Will see if it will fix the windows build.

@Blast545
Copy link
Contributor Author

Blast545 commented May 16, 2022

Silly me, I ran the launcher with a wrong configuration. Re running:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov MichaelOrlov merged commit e140329 into master May 16, 2022
@delete-merged-branch delete-merged-branch bot deleted the blast545/test_play_services_improvement branch May 16, 2022 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants